This section has a series of boxplots that compare performance of anonymous volunteers, registered volunteers, a single turker, and 5 turkers with majority vote. Here, a the single turker is the first turker to complete that set of routes.
For each granularity and label type, an ANOVA test was run, with the p-value being reported in the top-right corner of each boxplot. I appeneded a ** if the p-value was less than 0.01, and just a * if the p-value was less than 0.05, to make finding significant results easier.
To make room for the p-value on each boxplot, I expanded the y-limit past 1.0, and then added a dotted line at the 1.0 mark so that we still have that reference point. However, it makes the graphs look a bit ugly, so LMK if you want to remove it.
Defined as \(\frac{TP + TN}{TP + TN + FP + FN}\). Just the percentage of things they got correct.
Defined as \(\frac{TP}{TP + FN}\). High recall means that they found most of the issues/features.
Defined as \(\frac{TP}{TP + FP}\). High precision means that they rarely placed a label when they shouldn’t have.
Defined as \(2 * \frac{precision * recall}{precision + recall}\). It is essentially a balance between recall and precision.
Defined as \(\frac{TN}{TN + FP}\). Similar to precision, high specificity means that they rarely placed a label when they shouldn’t have, but specificity gives more weight to true negatives, while precision gives more weight to true positives.
In this section, there are a series of ograms that help to visualize the distribution of volunteers’ accuracy. For each accuracy measure, there is a grid of histograms split by label type and granularity (street, 5 meter, 10 meter).
Note that these histograms have lines representing the mean of each group (not the median; lmk if you want to see median instead).
Defined as \(\frac{TP + TN}{TP + TN + FP + FN}\). Just the percentage of things they got correct.
Defined as \(\frac{TP}{TP + FN}\). High recall means that they found most of the issues/features.
Defined as \(\frac{TP}{TP + FP}\). High precision means that they rarely placed a label when they shouldn’t have.
Note: Very little confidence should be given to precision for the NoSidewalk label, since GT labelers only placed the label at intersections and at places where a sidewalk ends.
Defined as \(2 * \frac{precision * recall}{precision + recall}\). It is essentially a balance between recall and precision.
Defined as \(\frac{TN}{TN + FP}\). Similar to precision, high specificity means that they rarely placed a label when they shouldn’t have, but specificity gives more weight to true negatives, while precision gives more weight to true positives.
In this section, there are a series of line graphs to help visualize how the number of turkers used and the method of voting affects the various accuracy measures. For each accuracy measure, there is a grid of line graphs, split by label type and granularity (same as above). However, each graph also has a line for each of the voting methods. You will notice that all voting methods are equivalent when looking at only one turker.
Defined as \(\frac{TP + TN}{TP + TN + FP + FN}\). Just the percentage of things they got correct.
Defined as \(\frac{TP}{TP + FN}\). High recall means that they found most of the issues/features.
Defined as \(\frac{TP}{TP + FP}\). High precision means that they rarely placed a label when they shouldn’t have.
Defined as \(2 * \frac{precision * recall}{precision + recall}\). It is essentially a balance between recall and precision.
Defined as \(\frac{TN}{TN + FP}\). Similar to precision, high specificity means that they rarely placed a label when they shouldn’t have, but specificity gives more weight to true negatives, while precision gives more weight to true positives.
In this section, we are looking at how accuracy is effected when we remove low severity labels from the GT set. The idea is that higher severity problems (as defined by the GT labelers) will be easier to crowd workers to see, and will give higher recall. This would be a nice result to have, to say that crowd workers can at least find the most severe problems. The two expected outcomes here, are for recall to go up and for precision to go down. I’ve included all accuracy types for now in case we see anything interesting.
Defined as \(\frac{TP + TN}{TP + TN + FP + FN}\). Just the percentage of things they got correct.
Defined as \(\frac{TP}{TP + FN}\). High recall means that they found most of the issues/features.
Defined as \(\frac{TP}{TP + FP}\). High precision means that they rarely placed a label when they shouldn’t have.
Defined as \(2 * \frac{precision * recall}{precision + recall}\). It is essentially a balance between recall and precision.
Defined as \(\frac{TN}{TN + FP}\). Similar to precision, high specificity means that they rarely placed a label when they shouldn’t have, but specificity gives more weight to true negatives, while precision gives more weight to true positives.